Skip to content

Conversation

Bouncheck
Copy link
Collaborator

Adds isolated SessionTicketsIT which checks if session tickets mechanism behaves as expected. It does that by watching Java's SSL debug logs and ensuring that specific substrings appear in them.

Since the server supports tickets with TLSv1.3, only that version is tested. In that version the session resumption without server-side state is done through pre-shared keys. The details are described in rfc8446 (see section 2.2).

@Bouncheck Bouncheck self-assigned this Aug 22, 2025
@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Aug 22, 2025

Tested with Java 8, 11 and 17.
Fixes #444 . Verifies that the feature is supported. Unfortunately the feature does less than what it was expected to do.

@Bouncheck Bouncheck requested a review from dkropachev August 22, 2025 17:26
@Bouncheck Bouncheck changed the title Introduce integration test for TLS session tickets 4.x: Introduce integration test for TLS session tickets Aug 22, 2025
Copy link
Collaborator

@dkropachev dkropachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way to test counters is prone to failure, if connection is re-estalished during the test (I have seen it happening), counters will drift and you going to get false-positive.

To test that TLS tickets work properly you need to establish connection, then break it and then check if ticket was reused, don't see it is happening here.

@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Aug 22, 2025

Way to test counters is prone to failure, if connection is re-estalished during the test (I have seen it happening), counters will drift and you going to get false-positive.

I don't think the false positive (test passing when it should not) is possible here, but false negative is.
Are you opposed to all counter/logs based approaches? I can fix the drift by counting the number of seen handshakes (for example by counting ClientHellos) and making other checks depend on that number rather than hardcoded number.

To test that TLS tickets work properly you need to establish connection, then break it and then check if ticket was reused, don't see it is happening here.

There are 5 resumptions and 6 handshakes happening in this test. I can make the checks more explicit by tracking the ticket data from the logs and match it against what is reported to be received through NewSessionTicket messages.

@dkropachev
Copy link
Collaborator

Way to test counters is prone to failure, if connection is re-estalished during the test (I have seen it happening), counters will drift and you going to get false-positive.

I don't think the false positive (test passing when it should not) is possible here, but false negative is.

https://en.wikipedia.org/wiki/False_positives_and_false_negatives :

A false positive is an error in [binary classification](https://en.wikipedia.org/wiki/Binary_classification) in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result incorrectly indicates the absence of a condition when it is actually present

Are you opposed to all counter/logs based approaches?

I’m totally fine with it, unless there’s a better way—which there isn’t.

I can fix the drift by counting the number of seen handshakes (for example by counting ClientHellos) and making other checks depend on that number rather than hardcoded number.
There are 5 resumptions and 6 handshakes happening in this test. I can make the checks more explicit by tracking the ticket data from the logs and match it against what is reported to be received through NewSessionTicket messages.

Given the fact that ChannelPool is initialized from one DriverChannel and later create per-shard DriverChannel's, it would be enough to test if Resuming session more than zero.

@Bouncheck Bouncheck force-pushed the scylla-4.x-integration-tls13 branch from 3e29572 to 2527f1e Compare August 26, 2025 00:52
@Bouncheck
Copy link
Collaborator Author

Set protocol version to V4, removed hardcoded expected number of initiated handshakes, removed some counters and made some checks more lenient.
Now this test checks if the resumptions are used at all when configured to use TLS 1.3.

I've also ran some tests with more nodes and higher smp and in general the feature is supported but probably does not work like we would want it to.

The server seems to consistently send two session tickets per connection. Setting higher smp did not seem to impact that. As was mentioned above when channel pool is created it first makes one connection to a node and then tries to add all remaining channels at once. Having a few surplus tickets is not going to help with initialization of all those connections.
Vast majority will not use PSK because there will not be any key available for use.

I have also tried to see what happens if i stop the node and start it again using ccm. Once the node is stopped all channels try to connect again and so far it seems that only one channel tries to use session ticket as if only one ticket was being kept in cache. Other connection attempts report that there is no PSK identity to use for them. In theory there should be around as many unused tickets as there are shards, since each connection receives two tickets. I did not find yet where does this result come from.
But even if there was more tickets, once the first reconnection attempts fail due to node not being back up yet, I don't think those tickets could be reused. It seems to me they are removed from cache once fetched for use, so all tickets would be wasted on attempts before node is back up anyway (if i understand correctly).

So since it does not provide any extra value I did not set higher smp or added more nodes to the test for now.

@dkropachev
Copy link
Collaborator

@Bouncheck , please update on what is going on.

@Bouncheck Bouncheck force-pushed the scylla-4.x-integration-tls13 branch 4 times, most recently from 7ca6356 to 4fd48e0 Compare August 28, 2025 11:18
@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Aug 28, 2025

In regard to the test:
I've set the number of shards to 4, however it seems that the more shards is there the more unstable starting node is. Sometimes ccm fails that step.
Added second method doing similar check, but expecting all resumptions to use PSK in order to track this issue. Marked as ignored so that CI is not needlessly red for now.

In regard to the whole issue:
It does seem that on reconnection and also on addMissingChannels multiple threads attempt to resume the same session. Only one succeeds at obtaining the preSharedKey, because once the first thread calls consumePskIdentity others will fail and report "PSK has no identity, or identity was already used".
sessionCache (in sslContext.clientCache) does hold multiple surplus entries. Some have PSK set, some not. On reconnection there is plenty of entries in session cache to choose from. I'm trying to figure out if it's possible to control which session is chosen for resumption.

Adds isolated SessionTicketsIT which checks if session tickets mechanism
behaves as expected. It does that by watching Java's SSL debug logs and
ensuring that specific substrings appear in them.

Since the server supports tickets with TLSv1.3, only that version is tested.
In that version the session resumption without server-side state is done through
pre-shared keys. The details are described in rfc8446 (see section 2.2).
@Bouncheck Bouncheck force-pushed the scylla-4.x-integration-tls13 branch from 4fd48e0 to 9effa23 Compare September 3, 2025 11:15
@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Sep 3, 2025

Moved some parts to separate methods to avoid repetition.
Changed the readiness check to also look at the size of the pools.
Marked the reconnection test method as expected to fail, since JDK implementation of TLSv1.3 does not seem to provide what we were looking for. I'll sum up the findings in parent issue.

The CI failure does not look to be related. This test class does not even target Cassandra and the failure is caused by the cluster version reported by the node. It contains -SNAPSHOT label for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants